This is my course page (Juha/monnis). Here is the link to my github repository.
This link leads to the R markdown cheat sheet. Should be useful and stuff.
Part 0: Libraries used and overall setup
Disclaimer: this first part looks a bit awful, but bear with me.
rm(list=ls())
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
Part 1: The data Structure of the data and the dimensions of the data
A sample of students giving answers to questions related to how they learn stuff and some background characteristics (e.g. age, gender). There is a grand total of 166 observations and 7 variables; one factor variable expressing gender,three integer variables expressing age, attitude & points for some exam and three numeric variables expressing indexes that are based on questions related to their methods of learning stuff. More precisely, whether their learning exhibits signs of so called strategic learning, deep learning and/or surface learning.
setwd("~/GitHub/IODS-project/data")
students2014 <- read.table("learning2014.txt")
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
dim(students2014)
## [1] 166 7
Part 2: Overview of the data
The students appear to be mostly female (110 as opposed to 56 males) and their median age is 22. Their point distribution is skewed to the right, which indicates that they are a quite capable lot. Their attitude seems to be the most important factor in determining the points with a whopping correlation of 0.437. Out of the learning indexes, the elements of stategic learning corralete positively with the points they receive, while surface learning’s magnitude is the same, but the sign is negative. Deep learning index matters the least, with close to zero correlation. Thus, it seems that the points earned favour strategic learning.
summary(students2014)
## gender Age Attitude deep stra
## F:110 Min. :17.00 Min. :14.00 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :32.00 Median :3.667 Median :3.188
## Mean :25.51 Mean :31.43 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :50.00 Max. :4.917 Max. :5.000
## surf Points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
ggpairs(students2014, lower = list(combo =wrap("facethist", bins=20)))
Part 3: Linear regression model
My chosen variables are attitude, stratetegic learning index (stra) and surface learning index (surf).
Out of the components, only attitude is statistically significant (***), which means that we can REJECT the null hypothesis (null hypothesis means that we presume the chosen regressor’s effect to be zero to begin with, i.e. that it doesn’t have any effect on the dependent variable). However, we cannot reject the null hypotheses regarding the chosen learning indexes. Perhaps, determining their role would require more observations or maybe more variation within the data or perhaps they don’t have any predicting power against the chosen dependent variable. For now, we can just say that with the given data, we cannot observe a statistically significant relationship between points and the learning indexes.
As for the significant regressor, the results imply that one point of the attitude metric predicts roughly 0.34 points in the exam. The intercept is ~11, meaning that this is the model’s baseline points for everyone - points that even a person with an attitude score of zero can get. However, this is a hypothetical case, as the minimum of attitude points in the sample is 14 points.
Linear_regression <- lm(Points ~ Attitude + stra+ surf, data = students2014)
summary(Linear_regression)
##
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## Attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra 0.85313 0.54159 1.575 0.11716
## surf -0.58607 0.80138 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Part 4: Variation of the model
Removing the nonsignificant variables.
For the model parameters, the results imply that one point of the attitude metric predicts roughly 0.35 points in the exam. The intercept is ~11.6, meaning that this is the model’s baseline points for everyone - points that even a person with an attitude score of zero can get. However, this is a hypothetical case, as the minimum of attitude points in the sample is 14 points. Removing the nonsignificant independent variables does not change the results much.
The R squared gains the value of 0.1906. This is the percentage of the variation that the linear model explains out of the total variation when compared to a model with no independent variables. In other words, we compare the residuals of a “baseline model” to the residuals of our fitted model. The baseline model is a prediction-wise useless model which always predicts the mean of observed dependent variable and does not use the independent variables at all. So basically, R squared explains how much more of the total variance, the chosen independent variables explain when compared to the baseline model. In an optimal scenario, an R squared of one would mean that the data points would overlap the regression line perfectly.
Ultimately, it must be noted that the R squared alone is not enough to indicate whether the model is bad (or good) - for instance a low R squared can be totally understandable, considering the complexity of the statistical relationship we are trying to model. Is this a big R squared? Well, to the best of my knowledge, in social sciences a low R squared can be justified, and this is not even that low. It all depends on the context: modelling the weather or macroeconomic fluctuations or the role of certain genes explaining the onset of psychosis are bound to have varying coefficients of determination.
Linear_regression2 <- lm(Points ~ Attitude, data = students2014)
summary(Linear_regression2)
##
## Call:
## lm(formula = Points ~ Attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Part 5: diagnostics
Naturally, there are more or less crucial assumptions behind the linear regression model.
Analyzing the residuals of the model- the difference between the fitted observations and the actual observations - gives insight whether these assumptions are violated.
The QQ-plot explores the normality assumption. In this case, the normality is somewhat satisfied with only some deviation from the line in the both ends of it.
The residuals versus fitted values plot explorest the constant variance assumption. As the name states, we try to find any patterns across the plotted values. Nothing alarming here either - the points are spread out somewhat randomly.
Finally, the leverage plot can help identify outliers’ impact on the model parameters. Outliers are problematic as they often don’t represent the sample, and can make the regression slope deviate from its “more true” course. Here, no outliers really stand out, which is great!
par(mfrow = c(2,2))
plot(Linear_regression2, c(1,2,5))
Part 0: Libraries used and overall setup
Disclaimer: this first part looks a bit awful, but bear with me.
rm(list=ls())
library(dplyr); library(ggplot2); library(GGally); library(readr); library(tidyr)
Part 2: The data
This data contains information about Portuguese students and their endeavours in life. We use it to analyze the subtle relationship of alcohol consumption and school performance. The variables are listed below.
setwd("~/GitHub/IODS-project/data")
alc <- read.table("alc.txt")
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Part 3: Choosing the variables My variables of choice are:
absences - Which comes first: absences or alcohol use? Causality is a bit questionable, but one way to hypothesize this is that more hangovers ( ans as such absences) means more alcohol use. There is a likely positive correlation.
romantic - Most persons in romantic relationships tend to become ex-drinking buddies. This is an observation in life, not a hypothesis :D
famsize - Family size might be an unintuitive choice at first glimpse, but should one have many siblings, some of them might be older and thus be able to purchase alcohol for their younger siblings. Personal experiances tell me, that this is one of the main supply channels of alcohol for young people.
studytime - Time spent studying is time away from drinking - as time spent drinking is time away from studying. Causality is again bit hard, but there will likely be a negative correlation.
Part 4: Exploring the chosen variables
The graphs indicate that over 2/3 of the students drink moderately. Two thirds is not romantically involved. Two thirds of the sample have more than 3 family members and it their weekly study time rarely exceeds 10 hours.
Regarding the cross-tabulations, I grouped moderate (including teatotallers) and heavy users i) according to their family size and ii) according to their romantic status.
It seems that children in bigger families do not drink heavily: It is more common for the children in the smaller families to consume high amounts of alcohol. On average, the high-users from bigger families are less absent than high-users from smaller families. The high-users average study time is smaller than of the student’s who drink moderately, but there seems to be no big (or interestening) difference regarding the family size. My hypothesis regarding family size seems to be highly questionable.
Being in a romantic relationship seems to be associated with lower levels of alcohol consumption if looking at the case amounts. Makes sense regarding my hypothesis. 45% of non-relationship students drink heavily as opposed to 37,5% of the couples-people.
For persons in a romantic relationship, there seem to be more absences for both the moderate and heavy drinkers when compared to non-relationship persons (probably spending time together, awww). If the romantic status is associated with high levels of alcohol use, the mean absences get even bigger (maybe they are drinking together more often?).
It must be noted, that heavy drinkers might not be that great significant other -material: they might be more interested in partying than datin, as such there might be selection problems and numerous other mechanisms/channels that I haven’t even thought of.
chosenvars <- c("absences", "romantic", "famsize", "studytime", "high_use", "sex")
alc2 <- select(alc, one_of(chosenvars))
str(alc2)
## 'data.frame': 382 obs. of 6 variables:
## $ absences : int 5 3 8 1 2 8 0 4 0 0 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ studytime: int 2 2 2 3 2 2 2 2 2 2 ...
## $ high_use : logi FALSE FALSE TRUE FALSE FALSE FALSE ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
gather(alc2) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
alc %>% group_by(famsize, high_use) %>% summarise(count = n(), mean_absences = mean(absences), mean_studytime = mean(studytime))
## # A tibble: 4 x 5
## # Groups: famsize [?]
## famsize high_use count mean_absences mean_studytime
## <fct> <lgl> <int> <dbl> <dbl>
## 1 GT3 FALSE 201 3.75 2.17
## 2 GT3 TRUE 77 6.05 1.79
## 3 LE3 FALSE 67 3.58 2.07
## 4 LE3 TRUE 37 7.03 1.73
alc %>% group_by(romantic, high_use) %>% summarise(count = n(), mean_absences = mean(absences), mean_studytime = mean(studytime))
## # A tibble: 4 x 5
## # Groups: romantic [?]
## romantic high_use count mean_absences mean_studytime
## <fct> <lgl> <int> <dbl> <dbl>
## 1 no FALSE 180 3.41 2.16
## 2 no TRUE 81 5.78 1.69
## 3 yes FALSE 88 4.32 2.12
## 4 yes TRUE 33 7.82 1.97
Part 5: Logistic regression to the rescue
It must be noted, that intuition tells me that the regression equation is “a bit dirty”. Study time and absences are probably correlated as heck, and combining an alcohol use indicator with absences and study time as such, well it sounds like an unholy triangle regarding the statistical inference.
Interpretation of the summary and the odds ratios
From my variables, study time and absences are statistically significant. More time spent studying indicates less heavy boozing and more absences probably indicates recovering from this heavy boozing (or something else, this is just a hypothesis).
The coeffient for study time is -0.53 (OR: 0.59) and the coefficient for absences 0.08 (OR: 1.09). The meaning of the raw coefficients is not that intuitive. The signs [+,-] and the magnitude of the coefficients are a bit informative, but if we exponent them, we change them to odds ratios.
For example an odds ratio of 0.59 with the study time indicates that a jump into a higher studying group (or category, as this is a categorical variable gaining the values 1-4) decreases the odds of drinking heavily. Vice versa, this means that a jump to a group of students that studies less (as it is a categorical variable gaining the values 1-4), makes the odds of drinking heavily (1/0.59) 1.69 times bigger. For the absences, one extra absence makes the odds of being a heavy drinker 1.08 times bigger.
m <- glm(high_use ~ absences + romantic + famsize + studytime, data = alc, family = "binomial")
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + romantic + famsize + studytime,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1302 -0.8348 -0.6421 1.1824 2.2477
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.22264 0.35314 -0.630 0.528393
## absences 0.08236 0.02300 3.581 0.000342 ***
## romanticyes -0.26346 0.25913 -1.017 0.309299
## famsizeLE3 0.31521 0.25692 1.227 0.219867
## studytime -0.53031 0.15542 -3.412 0.000645 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 431.24 on 377 degrees of freedom
## AIC: 441.24
##
## Number of Fisher Scoring iterations: 4
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.8004009 0.3996898 1.6005996
## absences 1.0858500 1.0402630 1.1385241
## romanticyes 0.7683888 0.4578880 1.2679157
## famsizeLE3 1.3705508 0.8239609 2.2609734
## studytime 0.5884228 0.4296281 0.7912992
Part 6: Exploring the predictive power
Well, it seems that my model is not that convincing.
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability>0.5)
## Warning: package 'bindrcpp' was built under R version 3.5.1
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 257 11
## TRUE 97 17
g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))
g <- g +geom_point()
g
Part 1: Libraries used and overall setup
rm(list=ls())
library(dplyr); library(ggplot2); library(GGally); library(readr); library(tidyr); library(MASS); library(corrplot)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## corrplot 0.84 loaded
Part 2: The Boston housing dataset
The dataset is housing values in the suburbs of Boston and contains aggregated data about median housing prices, crime levels per capita and other possibly relevant things. The data is aggregated by the suburb level and in total, there are 516 observations (suburbs) and 14 different variables, out of which all are either numeric orintegers.
In order to analyze the median housing prices, the so called hedonic pricing method can be used. Hedonic prices is an highly influential idea developed by the late economist Shervin Rose in the 70s: the price of a good can be derived from its characteristics, but of course it can be turned the other way around as well, to analyze crime rates for instance as we are doing here.
Link to the data and its source information: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
data("Boston")
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Part 3: The variables in the data
Below, I have summaries and distributions of the variables as well as a correlation matrix.I comment mostly on the things that appear as interesting to me. The proportion of houses built before 1940 is skewed to the right, thus there are less totally newer suburbs. Well, Boston is an old city, Most of the suburbs are somewhat close to the employment centers of Boston, which makes sense. For the per capita crime-levels, there seems to be an absurd level of crime in one suburb, but most of them are relatively crime-free. The percent of the lower status per population is also skewed to the left, which maeks sense, as Boston has always had that upper midclass sound to it. Lastly, the median (owner-occupied) home value is skewed to the left (below 30k USD). However, there is a noticeable spike on the 50k USD mark.
As for the correlations, crime correlates negatively with housing prices, positively with accessibility to radial highways, the property tax rate and the percentages of lower status population. The median price correlates quite negatively with the percentages of lower status population, which is intuitive. The proximity to Charles River does not correlate that much with, well anything. This is somewhat surprising, as I would have imagined that it would be correlated with housing prices and ages at least.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
gather(Boston) %>% ggplot(aes(value)) + facet_wrap(~key, scales="free") + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
cor_matrix<-cor(Boston) %>% round(2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
corrplot(cor_matrix, method="circle", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
Part 4: Standardizing the variables and other stuff
Standardizing variables or an entire dataset can be done for numerous reasons like for the sake of easier comparability. After the scaling, the variables’ distributions look exactly the same, but they are rescaled to have a mean of zero and standard deviation of one.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
boston_scaled <- as.data.frame(boston_scaled)
gather(boston_scaled) %>% ggplot(aes(value)) + facet_wrap(~key, scales="free") + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
label <- c("low","med_low","med_high","high")
crime <- cut(boston_scaled$crim, breaks = bins, labels = label, include.lowest = TRUE)
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
Part 5: Linear discriminant analysis
What we can see from the LDA-plot, is that accessibility to radial highways (rad) is an important feature contributing to LDA1, and it is not closely correlated with the other features. The length of the arrow tells us that it is a huge discriminant used in determining the high crime suburbs. LDA2 seems to determine the other crime level suburbs.
lda.fit <- lda(crime ~ ., data = train)
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes )
lda.arrows(lda.fit, myscale = 2)
Part 6: Predicting with the model
Our model predicts a suburb belonging to the high crime class extremely well, but the other lower crime classes leave a bit to hope for. Especially med_low tends to get classified as med_high. But overall, not too bad!
The reason why high crime suburbs are classified so well by the model can be seen from the LDA-plot: accessibility to radial highways seems to be a huge and distinctive factor in determining these suburbs. “Radan varrella sattuu ja tapahtuu”, some Finns might say.
Music recommendation: “Kuustoista kilsaa Kontulaan”
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 18 6 3 0
## med_low 6 14 11 0
## med_high 0 6 11 2
## high 0 0 0 25
Part 7: Cluster analysis
The optimal number of clusters seems to be 2. The model suggests that there are two kinds of suburbs in the city of Boston.
A glimpse to the plots reveals, that the black cluster is categorized by more crime, more pollution, being closer to radial highways and being away from employment centers. The red cluster is in many ways the opposite. Or that’s what my intuition tells me!
Since cluster analysis usually relies on the imagination of its user and Boston is a reputable city of Irish, I’m going to name these two kinds of suburb categories as “Carrot Top Hoods” (black cluster) and “True Kelt Dwellings” (red cluster).
data("Boston")
boston_scaled2 <- scale(Boston)
boston_scaled2 <- as.data.frame(boston_scaled2)
dist_eu <- dist(boston_scaled2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# k-means clustering
km <-kmeans(boston_scaled2, centers = 3)
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
km <-kmeans(boston_scaled2, centers = 2)
pairs(boston_scaled2[6:10], col = km$cluster)
pairs(boston_scaled2[1:5], col = km$cluster)
Part 0: Libraries used and overall setup
rm(list=ls())
library(dplyr); library(ggplot2); library(GGally); library(readr); library(tidyr); library(corrplot); library(FactoMineR)
Part 1: Making sense of the data
I’ll start by exploring ratios of labor force participation and secondary educative attainment between genders. Regarding the secondary educatio ratio, there are some countries, where there are more females with secondary education than makes, but overall, it seems that the ratio usually lies somewhere between 80% to 100% (as compared to males; 1 would mean equal amount of persons with secondary education). For the labor force participation rate, majority of countries lie between 0.5 and 1, but distribution of ratios is more uneven than the secondary education’s ratio and there are less cases where there are more females than males. For other participatory rates, females are less likely to be members of parliament, with the mean percent between countries being just 20.91.
As for the other variables, maternal mortality has some drastic outliers, as has life expectancy. A likely explanations for high maternal mortality and low life expectancy is poverty (and warfare, but we cannot identify it from the data) as there are strong negative correlations between GNI and maternal mortality and adolescent birth rates.
For other correlations, life expectancy correlates strongly (+) with expected education years and life expectancy also has is strong negative correlations with maternal mortality rate and adolescent birth rate, which makes sense, as maternal mortality and adolescent birth rate seem to correspondingly correlate strongly (+).
setwd("~/GitHub/IODS-project/data")
human <- read.table("human.txt")
summary(human)
## ed2r_FM lfr_FM Life_exp Educ_exp
## Min. :0.1717 Min. :0.1857 Min. :49.00 Min. : 5.40
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:66.30 1st Qu.:11.25
## Median :0.9375 Median :0.7535 Median :74.20 Median :13.50
## Mean :0.8529 Mean :0.7074 Mean :71.65 Mean :13.18
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:77.25 3rd Qu.:15.20
## Max. :1.4967 Max. :1.0380 Max. :83.50 Max. :20.20
## GNI Mat_mortr Adol_birthr Percent_MP_F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
gather(human) %>% ggplot(aes(value)) + facet_wrap(~key, scales="free") + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
M<-cor(human) %>% round(2)
M
## ed2r_FM lfr_FM Life_exp Educ_exp GNI Mat_mortr Adol_birthr
## ed2r_FM 1.00 0.01 0.58 0.59 0.43 -0.66 -0.53
## lfr_FM 0.01 1.00 -0.14 0.05 -0.02 0.24 0.12
## Life_exp 0.58 -0.14 1.00 0.79 0.63 -0.86 -0.73
## Educ_exp 0.59 0.05 0.79 1.00 0.62 -0.74 -0.70
## GNI 0.43 -0.02 0.63 0.62 1.00 -0.50 -0.56
## Mat_mortr -0.66 0.24 -0.86 -0.74 -0.50 1.00 0.76
## Adol_birthr -0.53 0.12 -0.73 -0.70 -0.56 0.76 1.00
## Percent_MP_F 0.08 0.25 0.17 0.21 0.09 -0.09 -0.07
## Percent_MP_F
## ed2r_FM 0.08
## lfr_FM 0.25
## Life_exp 0.17
## Educ_exp 0.21
## GNI 0.09
## Mat_mortr -0.09
## Adol_birthr -0.07
## Percent_MP_F 1.00
corrplot(M, type = "upper", method="square", cl.pos = "b", tl.pos = "d", tl.cex = 0.5)
Part 2: PCA with non-standardized variables
[Comparison of non-standardized and standardized PCAs under Part 3.]
pca_human <- prcomp(human)
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.5, 0.7), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
Part 3: Comparison between PCAs: non-standardized vs standardized variables
PCA does not funk that well if the variables are not standardized. This is due to the variable scaling: especially GNI seems to vary on a large scale as compared to other variables, which makes the analysis futile. As such, there’s not that much sense in trying to decipher these results, besides understanding that the analysis is ruined by one rampantly varying variable.
My face whilst trying to decipher the results
After we have standardized the variables, the results are very different and the biplot and the components seem meaningful.
A word about PCA, if you may. If we have a ton of variables, many of the variables can measure related attributes and as such are “redundant” in the analysis. In other words, they can measure the same phenomena and as such the same underlying things might affect them. For instance, intuition tells, that female to male labor force participation rate and the percentage of female parliament members are closely correlated outcomes determined by similar things in the society (religion, empowerment of females etc.) and as such, in general they measure the same phenomenon. In general that is, of course there might be drastic exceptions.
Based on the original variables, PCA constructs variables that capture the maximum amount of variance of the variables in the original data and can be used to reduce dimensions used in statistical analysis. Or to be precise, the first component captures maximum amount of variance, the second component captures maximum amount of variance left after the first one and so on. The principal components are uncorrelated, which is useful, since the original data can be a real cluster of heavily correlated variables. In this case, the two principal components depicted in the biplot capture 69,8% of the variance between countries.
We can analyze the first two components in the following manner with the biplot below.
human_std <- scale(human)
summary(human_std)
## ed2r_FM lfr_FM Life_exp Educ_exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.3503 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 2.6646 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNI Mat_mortr Adol_birthr Percent_MP_F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
pca_humanstd <- prcomp(human_std)
z <- summary(pca_humanstd)
pca_prz <- round(100*z$importance[2, ], digits = 1)
# print out the percentages of variance
pca_prz
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
# create object pc_lab to be used as axis labels
pc_labz <- paste0(names(pca_prz), " (", pca_prz, "%)")
# draw a biplot
biplot(pca_humanstd, cex = c(0.5, 0.7), col = c("grey40", "deeppink2"), xlab = pc_labz[1], ylab = pc_labz[2])
Part 4: Personal interpretation of the biplot components
In this case, the two principal components depicted in the biplot above capture 69,8% of the variance between countries.
The components and their main variables (as taken from the biplot arrows’ directions):
Female-male labour force participation and percentages of females as members of parliament contribute most to the PC2. I name this as the FEMALE PARTICIPATION PRINCIPAL COMPONENT. Female participation in politics and in labor markets seems to be a differentiate factor in analyzing the countries.
Expected education, GNI, life expectancy and female-male secondary educational attainment rate (all of them positively correlated + to PC1) as well negatively correlated maternal mortality and adolescent birth rates contribute most to the PC1. I name this OVERALL DEVELOPMENT PRINCIPAL COMPONENT.
So basically, we have the overall development and female participation components and they capture almost 70% of the variance between countries.
Part 5: Multiple correspondence analysis
Since there’s a load of variables, 36 to be exact, with a total of 300 observations, I’m going to limit the number of variables used in the analysis as done with the data camp exercises. All but one variables are factor variables, with at least 2 levels. The only integer variable is age. So basically, we have some tea drinkers here and we are trying to summarize what drives different their tea drinking.
Sooo, MCA is sort of PCA, except for it is used for factor variables. The components explain variance in the manner as with PCA.In the biplot, factor levels with similar profile are in close proximity to each other. What I find interesting, are the following combinations (and my interpretations):
*Teatime + pub + tea bag & unpackaged - Sounds like a highly British combination. Tea, teatime and the local village pub.
*Earl Grey + spirituality - If the nationality of the people that the sample covers is British, then this is an understandable combination! But for some reason, I would have anticipated green tea to be closer to spirituality. Maybe I don’t understand that much about tea drinking.
*Green tea + not tea time - Well sort of makes sense as green tea drinkers might not be the traditional British tea drinkers. Maybe I don’t understand anything about tea drinking.
By the way, MCA reminds me of Beastie Boys.
Music recommendation: “Sabotage”
data("tea")
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
summary(tea)
## breakfast tea.time evening lunch
## breakfast :144 Not.tea time:131 evening :103 lunch : 44
## Not.breakfast:156 tea time :169 Not.evening:197 Not.lunch:256
##
##
##
##
##
## dinner always home work
## dinner : 21 always :103 home :291 Not.work:213
## Not.dinner:279 Not.always:197 Not.home: 9 work : 87
##
##
##
##
##
## tearoom friends resto pub
## Not.tearoom:242 friends :196 Not.resto:221 Not.pub:237
## tearoom : 58 Not.friends:104 resto : 79 pub : 63
##
##
##
##
##
## Tea How sugar how
## black : 74 alone:195 No.sugar:155 tea bag :170
## Earl Grey:193 lemon: 33 sugar :145 tea bag+unpackaged: 94
## green : 33 milk : 63 unpackaged : 36
## other: 9
##
##
##
## where price age sex
## chain store :192 p_branded : 95 Min. :15.00 F:178
## chain store+tea shop: 78 p_cheap : 7 1st Qu.:23.00 M:122
## tea shop : 30 p_private label: 21 Median :32.00
## p_unknown : 12 Mean :37.05
## p_upscale : 53 3rd Qu.:48.00
## p_variable :112 Max. :90.00
##
## SPC Sport age_Q frequency
## employee :59 Not.sportsman:121 15-24:92 1/day : 95
## middle :40 sportsman :179 25-34:69 1 to 2/week: 44
## non-worker :64 35-44:40 +2/day :127
## other worker:20 45-59:61 3 to 6/week: 34
## senior :35 +60 :38
## student :70
## workman :12
## escape.exoticism spirituality healthy
## escape-exoticism :142 Not.spirituality:206 healthy :210
## Not.escape-exoticism:158 spirituality : 94 Not.healthy: 90
##
##
##
##
##
## diuretic friendliness iron.absorption
## diuretic :174 friendliness :242 iron absorption : 31
## Not.diuretic:126 Not.friendliness: 58 Not.iron absorption:269
##
##
##
##
##
## feminine sophisticated slimming
## feminine :129 Not.sophisticated: 85 No.slimming:255
## Not.feminine:171 sophisticated :215 slimming : 45
##
##
##
##
##
## exciting relaxing effect.on.health
## exciting :116 No.relaxing:113 effect on health : 66
## No.exciting:184 relaxing :187 No.effect on health:234
##
##
##
##
##
tea_time <- tea[,c("Tea", "how", "spirituality", "tearoom", "pub", "tea.time")]
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.258 0.214 0.183 0.165 0.144 0.134
## % of var. 19.364 16.050 13.696 12.401 10.789 10.054
## Cumulative % of var. 19.364 35.414 49.110 61.511 72.301 82.354
## Dim.7 Dim.8
## Variance 0.124 0.112
## % of var. 9.271 8.375
## Cumulative % of var. 91.625 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.527 0.359 0.275 | 0.502 0.392 0.249 | -0.323
## 2 | -0.527 0.359 0.275 | 0.502 0.392 0.249 | -0.323
## 3 | -0.175 0.040 0.060 | -0.061 0.006 0.007 | -0.464
## 4 | -0.515 0.342 0.299 | -0.558 0.485 0.352 | 0.123
## 5 | -0.515 0.342 0.299 | -0.558 0.485 0.352 | 0.123
## 6 | -0.596 0.459 0.598 | -0.130 0.026 0.028 | -0.245
## 7 | -0.175 0.040 0.060 | -0.061 0.006 0.007 | -0.464
## 8 | -0.106 0.015 0.012 | 0.571 0.508 0.352 | -0.542
## 9 | 0.204 0.054 0.056 | -0.032 0.002 0.001 | -0.071
## 10 | 0.454 0.266 0.108 | 0.617 0.593 0.200 | 0.278
## ctr cos2
## 1 0.190 0.103 |
## 2 0.190 0.103 |
## 3 0.394 0.423 |
## 4 0.027 0.017 |
## 5 0.027 0.017 |
## 6 0.110 0.101 |
## 7 0.394 0.423 |
## 8 0.536 0.317 |
## 9 0.009 0.007 |
## 10 0.141 0.040 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.255 1.039 0.021 2.527 | 1.204 27.840
## Earl Grey | 0.045 0.083 0.004 1.038 | -0.550 15.145
## green | -0.834 4.940 0.086 -5.071 | 0.516 2.281
## tea bag | -0.465 7.898 0.282 -9.188 | -0.214 2.019
## tea bag+unpackaged | 0.693 9.707 0.219 8.092 | -0.133 0.434
## unpackaged | 0.385 1.150 0.020 2.461 | 1.358 17.245
## Not.spirituality | -0.078 0.271 0.013 -2.001 | 0.372 7.400
## spirituality | 0.171 0.594 0.013 2.001 | -0.815 16.217
## Not.tearoom | -0.355 6.554 0.525 -12.531 | -0.046 0.135
## tearoom | 1.480 27.348 0.525 12.531 | 0.193 0.562
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.475 11.911 | -0.392 3.460 0.050 -3.879 |
## Earl Grey 0.545 -12.768 | -0.193 2.188 0.067 -4.483 |
## green 0.033 3.137 | 2.008 40.482 0.498 12.207 |
## tea bag 0.060 -4.230 | -0.438 9.920 0.251 -8.660 |
## tea bag+unpackaged 0.008 -1.558 | 0.571 9.318 0.149 6.668 |
## unpackaged 0.252 8.674 | 0.578 3.655 0.046 3.689 |
## Not.spirituality 0.303 9.522 | -0.295 5.469 0.191 -7.562 |
## spirituality 0.303 -9.522 | 0.647 11.986 0.191 7.562 |
## Not.tearoom 0.009 -1.635 | -0.103 0.775 0.044 -3.625 |
## tearoom 0.009 1.635 | 0.428 3.235 0.044 3.625 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.094 0.581 0.505 |
## how | 0.291 0.253 0.251 |
## spirituality | 0.013 0.303 0.191 |
## tearoom | 0.525 0.009 0.044 |
## pub | 0.220 0.129 0.026 |
## tea.time | 0.406 0.009 0.078 |
plot(mca, invisible=c("ind"), habillage = "quali")
Part 0: Libraries used and overall setup
rm(list=ls())
library(dplyr); library(ggplot2); library(GGally); library(readr); library(tidyr); library(lme4)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following object is masked from 'package:tidyr':
##
## expand
setwd("~/GitHub/IODS-project/data")
ratsl <- read.table("RATSL.txt")
bprsl <- read.table("BPRSL.txt")
ratsl$ID <- factor(ratsl$ID)
ratsl$Group <- factor(ratsl$Group)
I am quite slavishly following the chapters in the books.Like a rat following the pied piper.
The data used in this first part:
*Rats - Three different groups of rats (a total of 16 rats) are fed different food and change in their weight is observed. The data is originally from Crowder & Hand (1990). Used in this first part.
First, I plot the individual rat observations by time. Group signals the colour.
[Important notice: In the Vehkalahti & Everett (2018?) chapter 8, there layout was three graphs, one graph per group that is, but I chose to plot all the rats in the same graph and just distinguish the groups by color, as there are only 16 rats. The information provided by both approaches is somewhat equal.]
Whoa, there is one huge rat in the second group! Otherwise, the initial levels of the rat weight and the slopes seem to be quite close together group-wise.
The biggest rodent in the world is capybara.
Capybaras are wicked cool animals.
Respected by all.
ggplot(ratsl, aes(x = Time, y = Weight, group=ID, col = Group)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
scale_y_continuous(name = "rat weight", limits = c(min(ratsl$Weight), max(ratsl$Weight)))
Now if we standardize the weights (below), the slopes get more equal.There seems to be no differences in the growth rates of the rates time-wise.
ratsl <- ratsl %>%
group_by(Time) %>%
mutate(stdweight = scale(Weight)) %>%
ungroup()
ggplot(ratsl, aes(x = Time, y = stdweight, group=ID, col = Group )) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
scale_y_continuous(name = "Standardized rat weight")
Next, I plot how the mean weight has changed by group.
Not that much new information here, as anticipated, the standard errors are highest among the second group. Yup, the one with that huge rat.
n <- ratsl$Time %>% unique() %>% length()
ratsass <- ratsl %>%
group_by(Group, Time) %>%
summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n)) %>%
ungroup()
ggplot(ratsass, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
geom_point(size=3) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
theme(legend.position = c(0.9,0.5)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
Summary measures approach in the form of boxplots. Boxes by groups. Excluding that one huge rat. The groups differ drastically.
ratsass2 <- ratsl %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise( mean=mean(Weight) ) %>%
ungroup()
# Glimpse the data
glimpse(ratsass2)
## Observations: 16
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, ...
str(ratsass2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 16 obs. of 3 variables:
## $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ mean : num 263 239 262 267 271 ...
ratsass3 <- filter(ratsass2,mean<550)
str(ratsass3)
## Classes 'tbl_df', 'tbl' and 'data.frame': 15 obs. of 3 variables:
## $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ mean : num 263 239 262 267 271 ...
ggplot(ratsass3, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), exluding initial Weight")
Next we check whether these differences appear to be statistically significant. As there are three groups, ANOVA is used instead of a two-sample t-test. The null hypothesis is that the means of the groups are the same and we test whether they differ significantly both for the average weight of the inspection period (t>WD1) and for the baseline weigh (t=WD1).
Assumptions are, that the observations are independent from each other, the data of each group is normally distributed and they have a common variance.
According to the Anova, the null hypothesis can be only rejected with the starting weight. Only the baseline weight is significantly different, otherwise, whatever is fed to those poor rats, is not working, which might be a good thing.
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt")
ratsass4 <- ratsass2 %>% mutate(baseline = RATS$WD1)
fit <- lm(mean ~ baseline + Group, data = ratsass4)
anova(fit)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 253625 253625 1859.8201 1.57e-14 ***
## Group 2 879 439 3.2219 0.07586 .
## Residuals 12 1636 136
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The data used in this second part:
*Bprs - 40 males are divided into treatment and control groups with 20 males in each group (By the way, I’m calling treatment=2 the treatment group and treatment=1 the control group). The brief psychiatric rating scores (bprs) of these males are measured weekly for a total of 8 weeks. Bprs is a rating scale, which measure different psychiatric symptoms (Wikipedia). The data is originally from Davis (2002).
A few words about multilevel modeling, if I may. A basic way to distinguish between different multilevel models is to categorize them into i) random intercept model, ii) random slope model and iii) a combination of the two, random intercept + slope model. Personally, I prefer different naming conventions with “varying intercept model” and “varying slope model”, as they are more intuitive, but that might just be me.
What does random or varying mean in this context? Traditionally with these models, we can take into account that the data structure is nested (to my understanding that is). One classic example of a nested data structure could be GPAs of students from different classes and schools. A usual assumption for regression model is that the observations are independent, but this might not be a valid assumption. With multilevel models we can allow heterogeineity between groups or individuals slopes and/or intercepts. For instance, these different classes and schools might affect the outcome variable of the students in some way (their outcomes are correlated by the schools or classes) and with multilevel models we can get this sort of main effect and then consequently analyze what kind of effects different nested structures might have on the intercepts or slopes in the analysis.
It must be noted that here we use multilevel models to allow slopes and intercepts differ by individual rat. The main point is that traditional regressions assume that the observations are independent, which is a rather invalid assumption with a dependent variable like rat weight. Rat weight at time=0 and weight at time=1 are unlikely to be independent observations. As such, we take theintraclass correlation of an individual rat’s weight between different time points into account, making this a more robust form of analysis.
ON with the show! First I plot the bprs by the treatment and the control groups. It’s quite a mess to be honest. In the plotting of the individual time series, the use of two different graphs for control and treatment groups is justified, even though, it’s still kinda messy
bprsl$treatment <- factor(bprsl$treatment)
ggplot(bprsl, aes(x = week, y = bprs, linetype = as.factor(subject))) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(bprsl$bprs), max(bprsl$bprs)))
Then a standard issue linear regression is performed. The treatment is not statistically significant, while both groups have a declining trend with time. It must be noted that this regression suffers from the issues mentioned above and ignores the repeated measures structure of the data, which means correlation between an individuals observations/ One possible reason might be due to an individual’s chemical balance of the brain, or whatever unobservables correlated in time there might be.
bprsl_reg <- lm(bprs ~ week + treatment, data = bprsl)
summary(bprsl_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = bprsl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
Random intercept model to the rescue! This means, that we allow the intercept to vary for each individual male. So no more strict independence assumed for the observations. But this is likely not enough, as it is easy to imagine the slopes being different across individuals as well.
The estimated variance of the individual random effects is quite large, signaling variation in the fitted individual intercepts.Estimated parameters for treatment and week are similar as in the linear regression estimated above, except for the standard errors, which are a bit smaller. So controlling for the within individual correlation does seem to pay off!
bprsl_int <- lmer(bprs ~ week + treatment + (1 | subject), data = bprsl, REML = FALSE)
summary(bprsl_int)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: bprsl
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
Random intercept and random slope model
As stated by the name of the model, now each individual is allowed to differ both in terms of intercept and slope. So heterogeineity all over the place is tolerated.
We compare the random intercept and random intercept + slope models with an anova table. Chisquared and p-value suggests that the random intercept + slope model prevails over just random intercept model (as Datacamp suggests, the lower the values, the better the fit against the comparison model).
bprsl_intslope <- lmer(bprs ~ week + treatment + (week | subject), data = bprsl, REML = FALSE)
summary(bprsl_intslope)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: bprsl
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8202 8.0511
## week 0.9608 0.9802 -0.51
## Residual 97.4307 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
anova(bprsl_intslope, bprsl_int)
## Data: bprsl
## Models:
## bprsl_int: bprs ~ week + treatment + (1 | subject)
## bprsl_intslope: bprs ~ week + treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## bprsl_int 5 2748.7 2768.1 -1369.4 2738.7
## bprsl_intslope 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636
##
## bprsl_int
## bprsl_intslope *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Random intercept and random slope model with timeXgroup interaction
The interaction is not statistically significant on a 95% CI, although, some general remarks can be made. Overall, for both groups the BPRS scores seem to go down, but for males in the treatment group, the pace (or slope) is smaller. Something might be going on there, but this would probably need more individuals and more observations for statistical power.
Or the results could just be random walk of sorts: Replication Crisis.
bprsl_intslopeX <- lmer(bprs ~ week + treatment + week*treatment + (week | subject), data = bprsl, REML = FALSE)
Fitted <- fitted(bprsl_intslopeX)
bprsl <- bprsl %>% mutate(Fitted)
summary(bprsl_intslopeX)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + week * treatment + (week | subject)
## Data: bprsl
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0767 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 65.0016 8.0624
## week 0.9688 0.9843 -0.51
## Residual 96.4699 9.8219
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2522 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
Comparison: Observed vs fitted bprs by groups for the ending! Every male has their own slope and intercept. Seems cool!
Thanks for reading this far! Have a great remainder of the year and if things get stressful and hectic, remember the capybaras!
bprsl <- mutate(bprsl, subject2 = ifelse(bprsl$treatment==2, bprsl$subject+20, bprsl$subject))
ggplot(bprsl, aes(x = week, y = bprs, group = subject2)) +
geom_line(aes(linetype = treatment)) +
geom_line(aes(linetype = treatment)) +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
scale_y_continuous(name = "BPRS") +
theme(legend.position = "top")
ggplot(bprsl, aes(x = week, y = Fitted, group = subject2)) +
geom_line(aes(linetype = treatment)) +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ treatment, labeller = label_both) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
scale_y_continuous(name = "Fitted BPRS") +
theme(legend.position = "top")